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y—i Abstract 

Wc present adaptive on-line schemes for lossy encoding of individual sequences un- 
der the conditions of the Wyner-Ziv (WZ) problem, i.e., the decoder has access to side 
L I information whose statistical dependency on the source is known. Both the source se- 

^ quence and the side information consist of symbols taking on values in a finite alphabet 

X. In the first part of this article, a set of fixed-rate scalar source codes with zero delay 
' is presented. We propose a randomized on-line coding scheme, which achieves asymp- 

^ totically (and with high probability), the performance of the best source code in the set, 

^ uniformly over all source sequences. The scheme uses the same rate and has zero delay. 

We then present an efficient algorithm for implementing our on-line coding scheme in 
the case of a relatively small set of encoders. We also present an efficient algorithm for 
the case of a larger set of encoders with a structure, using the method of the weighted 
graph and the Weight Pushing Algorithm ( WPA) . In the sec;ond part of this article, we 
OO extend our results to the case of variable-rate coding. A set of variable-rate scalar source 

codes is presented. We generalize the randomized on-line coding scheme, to our case. 
This time, the performance is measured by the Lagrangian Cost (LC), which is defined 
as a weighted sum of the distortion and the length of the encoded sequence. We present 
. ^ an efficient algorithm for implementing our on-line variable-rate coding scheme in the 

case of a relatively small set of encoders. We then consider the special case of lossless 
^ variable-rate coding. An on-line scheme which use Huffman codes is presented. We 

show that this scheme can be implemented efficiently using the same graphic methods 
from the first part. Combining the results from former sections, we build a generalized 
efficient algorithm for structured set of variable-rate encoders. Finally, we show how 
to generalize all the results to general distortion measures. The complexity of all the 
algorithms is no more than linear in the sequence length. 

Index Terms: side information, Wyner-Ziv problem, source coding, on-line 
schemes, individual sequences, expert advice, exponential weighting 
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1 Introduction 



Consider a communication system with the following components: an individual source 
sequence to be compressed, a discrete memoryless channel (DMC) with known statistics, 
a noiseless channel with rate constraint R, and a decoder. The encoder maps the source 
sequence, xi,X2, ■ ■ ■ , x„, into a sequence of channel symbols, zi,Z2, ■ ■ ■ , Zn, taking values in 
{1,2,..., M}, M = 2^, which is transmitted to the decoder via a noiseless channel. The 
decoder, in addition to the encoded data arriving from the noiseless channel, has access 
to a side information sequence, yi,y2, . ■ ■ ,yn, which is the output of the DMC fed by 
the source sequence. Using the compressed data, zi, Z2, ■ ■ ■ , Zn, and the side information, 
the decoder produces a reconstructed sequence xi,X2, ■ ■ ■ ,Xn- The goal is to minimize 
the distortion between the source and the reconstructed signal by optimally designing an 
encoder-decoder pair. This is a variation of the problem of rate-distortion coding with 
decoder side information, which is well known as the Wyner-Ziv (WZ) coding problem, 
first introduced in [6]. The case of scalar source codes for the WZ problem was handled 
in several papers, e.g. [7] and [8]. In contrast to our case, these schemes operate under 
specific assumptions of known source statistics. WZ coding of individual sequences was 
also considered, e.g. in [9] and [10], and existence of universal schemes was established. 
However, these schemes are based on block coding or DUDE implementation and assume 
the knowledge of the source and side information sequences in advance. Thus, they are 
irrelevant to the case of on-line encoding considered here. 

A coding scheme is said to have an overall delay of no more than d if there exist positive 
integers di and d2, with di + d2 < d, such that each channel symbol at time t, Zt, depends 
only on xi, . . . , Xfj^di, and each reconstructed symbol xt depends only on zi, . . . , Zf+d-z ^-iid 
1/1 , ... , yt+d2 ■ Weissman and Merhav [3] , following Linder and Lugosi [2] , constructed a 
randomized limited delay lossy coding scheme for individual sequences using methods based 
on prediction theory. These schemes perform, for any given reference class of source codes, 
called experts, almost as well as the best source code in the set, for all individual sequences. 
The performance of the scheme is measured by the distortion redundancy, defined as the 
difference between the normalized cumulative distortion of the scheme and that of the best 
source code in the set, matched to the source sequence. The scheme is based on random 
choices of source codes from the set. The random choices are done according to exponential 
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weights assigned to each code. The weight of each source code, each time we choose a code, 
depends on its past performance and has to be calculated. Thus, implementing this scheme 
for a large set of source codes requires efficient methods, to prevent prohibitive complexity. 
Gyorgy, Linder and Lugosi offered efficient algorithms for implementing such a scheme for 
sets of scalar quantizers [4], [5] without side information. Our main contribution in this 
paper is to extend this scenario to include side information at the decoder, in the spirit of 
the WZ problem, for both the fixed rate case and the variable-rate case. 

In the first part of this paper, a fixed-rate, zero-delay adaptive coding scheme for indi- 
vidual sequences under the WZ conditions is presented. We define a set of scalar source 
codes for the WZ problem. Then, the scheme of [3] is extended for the WZ problem w.r.t. 
the Hamming distortion measure. For any given set of WZ source codes, this scheme per- 
forms asymptotically as well as the best source code in the set, for all source sequences. 
We then demonstrate efficient implementations of this scheme. First, it is shown that the 
scheme can be implemented efficiently for any relatively small set of encoders, even though 
the set of decoders is large. Then, using graph-theoretic methods similarly to [5], we show 
that we can implement the scheme for large sets of scalar encoders with a structure. 

In the second part of the paper, we extend the results of [3], and the coding schemes from 
the first part, to the variable-rate coding case. Without loss of generality, we assume that 
the noiseless channel is binary. The encoder, instead of using a fixed-rate code for encoding 
the source sequence into AI symbols, now uses a variable-length binary prefix code with M 
codewords. The decoder, upon receiving the binary encoded sequence, first produces the 
indexes of the codewords transmitted, and then continues exactly as in the fixed-rate case. 
The prefix property enables instantaneous decoding of the codewords. 

We start by defining a set of variable-rate scalar source codes. Then, the scheme of [3] is 
generalized to the variable-rate case. The performance is now measured by the LC function, 
which is defined as a weighted sum of the distortion and the length of the binary encoded 
sequence. As before, for any given set of variable-rate source codes, this scheme performs 
asymptotically as well as the best source code in the set, for all source sequences. We then 
demonstrate efficient implementations of this scheme. Again, it is shown that the scheme 
can be implemented efficiently for any relatively small set of encoders, in a way similar to the 
fixed-rate case. Then, we handle the special case of lossless variable-rate coding. We first 
demonstrate a method of representing sets of Huffman codes on an a-cyclic directed graph. 
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Using this representation and the WPA, we present efficient implementation for the lossless 
case. Then, combining this result and the set of encoders with a structure from the fixed- 
rate case, we show that we can implement the generalized variable-rate scheme for large 
sets of scalar encoders. Finally, all the implementations are generalized to accommodate 
any distortion measure, at the price of increased complexity. 

It should be pointed out that the development of our efficient on-line scheme in the 
fixed-rate case, is not a straightforward extension of those in [4], [5] because of the following 
reasons: (i) Due to the side information, the optimal partition of the source alphabet does 
not necessarily correspond to intervals, (ii) The problem of choosing an expert is more 
complicated in the WZ setting. In [4] and [5], which deals with quantizers, the problem 
of choosing an expert is reduced to the problem of choosing decoding points. Given these 
points, the encoder is chosen to be the nearest-neighbor encoder which uses this points. 
In our case, this mechanism is irrelevant, and we have to choose the encoder and decoder 
separately. 

The remainder of the paper is organized as follows. In Section 2, a formal description 
of the fixed-rate case is given. In Subsection 2.1, we define the set of WZ scalar source 
codes. A general coding scheme, which achieves essentially the same performance as the 
best in a given set of WZ codes, is presented in Subsection 2.2. Section 3 is dedicated to 
the efficient implementation of this scheme for sets of scalar source codes. In Subsection 
3.1, we present an efficient implementation for large sets of encoders with structure, using 
graphical methods. In Section 4, we give a formal description of the problem for the 
variable-rate case. In Subsection 4.1 we define the set of variable-rate source codes. In 
Subsection 4.2, we generalize the scheme and results of section 2. In Section 5, we present 
an efficient implementation of this scheme for scalar source codes with variable-rate coding. 
In Subsection 5.1 we handle the special case of lossless variable-rate coding. We establish 
efficient scheme which achieves essentially the same compression of the best in a given set 
of Huffman codes. In Subsection 5.2, we present an efficient implementation for the general 
lossy coding scheme. In Subsection 5.3, we show how to generalize our results to any 
bounded distortion measure. Finally, in Subsection 5.4, we describe the implementation of 
our variable-rate coding scheme, for the special case of quantizers which use Huffman codes. 
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2 Definition of an on-line Adaptive WZ Scheme 

Throughout this paper, for any positive integer n, we let a" denote the sequence ai, 02, . . . , a„. 
Given a source sequence x", the encoder maps the source sequence into a sequence 
whose symbols {zi} take on values in the set {1,2,..., M}. The decoder, in addition to 
z", has access to a sequence y", dependent on via a known DMC, defined by the single- 
letter transition probability PY\x{yi\xi), which is the probability of Ui given Xj. Based on 
and y", the decoder produces the reconstructed sequence .x". For convenience, we assume 
that Xi, Ui and Xi, all take on values in the same finite alphabet X with cardinality 
All the results can be generahzed straightforwardly to the case of different alphabets. The 
distortion between two symbols is defined to be the Hamming distortion: 

. / if a; = X , . 

^^^'^^ = 1 1 elsewhere. 

We define the distortion for the input symbol Xt at time t, = 0, 1, 2, . . .) as: 

At(xt) =Ep(xt,xt(2*,y*)) (2) 
where the expectation is taken with respect to y*. 
2.1 Definition of the reference set of source codes 

In this part, we define a general set of scalar source codes, here referred to as experts. 
Each expert is a source code with a fixed rate, R = logM, which partitions X into M 
disjoint subsets (mi,m2, . . . ^tum)- The encoder e for each expert is given by a function 
e : X —*■{!, 2, . . . , M} that is, Zi = e{xi). The decoder d receives Zi, together with the side 
information y^, and generates Xj, using a decoding function d : {1,2, . . . , M} x X ^ X, i.e., 
Xi = d{zi, Hi). The definition above is not complete. It is easy to see that different encoders 
may actually implement the same partition. For example: if X ={1,2,3} and M = 2, 
consider the two encoders: 

ei: ei(l) = l, ei(2) = 2, ei(3) = 2 

62 : 62(1) = 2, 62(2) = 1, 62(3) = 1 ^ ' 

It is easy to see that they have the same functionality. In our definition we treat these 
encoders as the same encoder, otherwise, the same expert will be taken into account several 
times. The number of times depends on the specific partition, so we will get an unbalanced 
weighting of experts. 



2.1.1 Definition of tlie encoders using the psirtition matrix 

To define an encoder uniquely, and to get bounds on the cardinality of the general set of 
encoders, let us define the partition matrix: 



where j, Z G {1, 2, . . . , are the indexes of the alphabet letters, Xj,xi G X, given that 
we ordered the alphabet in some arbitrary order. 
The properties of PM: 

1. PMije (0,1). 

2. If i = j then PMij = 1. 

3. PM is symmetric, i.e. PMij = PMj^i 

4. If PMij = 1 and PMj_fe = 1 then PMj^k = 1- 

5. If PMij = 1 and PMi^k = then PMj^k = 0. 

It is easy to see that each partition matrix (a matrix which has the above properties) defines 
unique partition of the alphabet thus defines an encoder uniquely. Using the properties of 
this matrix, we can derive bounds on the number of encoders: 



The lower bound is derived from the fact that the first row to be determined has l-^l — 1 
degrees of freedom, i.e., it can be any binary vector of length \X\ and with the first element 
equals to 1. This reflects the fact, that the choice of the first subset of letters is unrestricted. 
The upper bound is derived from the fact that the number of encoders without the limitation 



in Therefore, using the general set of encoders is a challenge from a computational 
complexity point of view. 

2.1.2 Definition of the decoders 

We limit our discussion to decoders which satisfy: 





21-^1-1 < Number of PM's < 2^^^^°^ 



(5) 



of counting every partition only once is M'*^'. So the number of encoders is exponential 



d{e{x), y) = X, for all x and y. 



(6) 
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which means that the decoded symbol x, is chosen from the same subset of the input 
symbol x. Using the Hamming distortion measure, it is easy to see that there is no point to 
choose Xi outside the subset m^., hence this set of decoders is sufficient. For other distortion 
measures, the results can be generalized straightforwardly to the set of all possible decoders. 
From the above definition, we see that every encoder defines a set of possible decoders. This 
set consists of all combinations of choices of x from the set z, for different pairs {z,y). 

2.1.3 The set of scalar source codes 

We define J^^^{M) as the set of all scalar WZ source codes with rate R = logM, i.e. all 
the pairs that consist of a scalar encoder and one of its possible decoders, as defined in 
Remark. In contrast to our case, when there is a known joint distribution P{xi,yi) , then 
given the encoder and yi, the best strategy for minimizing the Hamming distortion is, of 
course, maximum likelihood, i.e., choose the most probable x from the subset m^^, given y^. 

X = arg max Px\Y{x\yi) (7) 

However, in our case, Px\Y{^\y) is unavailable since P{x), the source statistics, is unknown 
or non-existent. Therefore, knowing the encoder is not sufficient for determining the best 
decoder. 

2.2 An on-line WZ coding scheme 

In this part, we describe an on-line adaptive scheme for the WZ case based on the results of 
[3]. For any source sequence rr", the distortion A"g^^(a;") of a source code {e,d) is defined 
by: 

1=1 

where Ai{xi) is as defined in (|2]). In the case of a scalar source code, we get: 

n 

i=l y&X 

Given any finite set of scalar source codes with rate R and zero delay, this scheme (which 
has the same rate R and zero delay) achieves asymptotically the distortion A"g^^(2;") of 
the best source code in the set, for all source sequences x'^. To be more specific, we extend 
[3, Theorem 1] to include side information at the decoder. We get, that for any bounded 
distortion measure {p{x,x) < B,\/x,x G X for some positive real number B), the following 
result holds: 
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Theorem 1 Let A be a finite subset of T^^{M). Then there exists a sequential source 
code (e, d) with rate R = logM and zero delay such that for all S X"^: 

E{i[AJl_^-^(x") -mm(,,,,,)e^A-,^,)(x")]} < ^[log (lo) 

where the expectation is taken w.r.t. a certain randomization of the algorithm, which wih 
be described below. For the Hamming distortion measure, we have B = \. The proof is 
similar to the proof of [3, Theorem 1], where in om' case, we use the distortion as defined 
in (|8|. Since the proof steps are the same, we will not repeat them here. 

The scheme works as follows: Assume some reference set A of WZ scalar source codes. 
We divide the time axis, i = 1,2,. . . ,n, into K = n/l consecutive non-overlapping blocks 
(assuming / divides n), where / is a parameter to be determined. At the beginning of each 
block, i.e., at times t = {k — 1)?, k £ {1,2, . . . , K}, we randomly choose an expert according 
to the exponential weighting probability distribution: 

exp{— r/A|g ^-((a;*)} 

Pr{next expert = (e, d)} = ^ ^^^r 7"^ (H) 

where 77 > is a parameter to be determined. Notice that for t = 0, we get uniform 
distribution. After choosing the expert (e, d), the encoder dedicates the first [log|^|/i?] 
channel symbols, at the beginning of the A:-th block, to inform the decoder the identity 
of d. At the remainder of the block, the encoder produces the channel symbols Zi = 
e{xi). At the same time, at the decoder side, in the beginning of the block, at times 
i = (A: — 1)/ + 1, ... , [log l^l/ii], the decoder outputs arbitrary symbols from X. At the 
rest of the block, knowing d, it reproduces Xi = d{zi,yi). 

Exactly as in [3], the values of / and r] are optimized to get minimal redundancy, and 
are given by: 

I = 2{log(|^|)n/i?2}| 

rj = {8log{\A\)/lB^n}'2 ^ ' 

Remark. Throughout this paper, we assume that n is known in advance. Generalizing the 
scheme to the case where the horizon is unknown is straightforward, as explained in [3]. 

3 Efficient implementation for sets of scalar source codes 

In this section, we present an efficient implementation of the scheme described in Section 
2, for sets of scalar source codes. Each one of these sets of source codes consists of all pairs 



{e,d), where e is one of the encoders in some small set of encoders, and d is one of its 
possible decoders, as defined in Subsection 2.1 . By "small set", we mean that the random 
choice of the encoder can by done directly (as will be explained below). This definition 
depends, of course, on the computational resources we allocate. Remember that given a 
specific encoder, the decoder, for each {z,y), chooses some x from the subset of source 
letters m^. Thus, for each pair {z,y) there are |m^| possible x's. Hence, given an encoder, 
the number of possible decoders is: 

M 

n n = (l"^ill"^2| . . . |mAf 1)1^1 > 2l^l (13) 

y&X z=l 

where |m^| is the cardinality of the subset of letters m^. The lower bound is derived from 
the fact that in the lossy encoding case M < so the product above is at least 2. 
Thus, given a set of encoders, the number of possible WZ source codes is at least |i?|2l'^l, 
where \E\ is the number of encoders. Given a set of experts A, we follow the scheme of 
the previous subsection. We divide the time axis, i = 1,2, ... ,n, into K = n/l consecutive 
non-overlapping blocks. We randomly choose the next expert at the beginning of each block 
according to the exponential weighting probability distribution. The distortion of an expert 
(e, d) at time t is given by: 

a;,,,)(x*) = E|=iA(x,) 

= T!t=iY.yPY\x{y\xt)p[xi,x[xi,y)) ^^^^ 

= E*=l PY\x{y\Xi)I^x,,y)&A 
= T.x,y&xMx)PY\x{y\x)I(^^,y)eA 

where A is the set of all pairs {x,y) which contribute to the distortion, i.e., d{e{x),y) ^ x, 
Ib is the indicator function for an event B, and nt{x) is the number of times x appeared in 



X*. For a more convenient form of (11), we multiply the numerator and denominator with 



^w{'nT.x,y&x nt{x)PY\x{y\x)} and we get: 

dW = 

^(e',d')(^A^{e',d.'),t 



Pr{next expert = {e,d)} = = (^vjO^^ ^X5^ 



where: 

A(e,d),i = exp{r/ nt{x)PY\x{x,y)I^x,y)eA} (16) 

x,y^X 

where A is the complementary set of A, i.e. all pairs {x,y) such that d{e{x),y) = x. Given 
a set of experts, the random choice of an expert at the beginning of each block is done in 
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two steps. First, we choose an encoder randomly according to: 



F 

Pr{next encoder = e} = = '— — (17) 

where E is the set of encoders, and: 

{e,d)eAe 

is the sum of the exponential weights of all experts in Ae, where Ae is the subset of all 

experts which use the encoder e. F^^t can be calculated efficiently in the following way: For 
each pair (x, y) calculate Xx,y,t where: 

K,y,t = exp{rynt(x)Py|x(y|a;)} (19) 

and then for each {z, y), calculate the sum Ylx-e{x)=z ^x,y,t where e{x) is the encoding of x. 

Lemma 1.1 The product of all these sums is Fe^t-' 

^e.*=nn ( E (20) 

z=ly£X \x:e{x)=z J 

Proof. 

112=1 IlyeX iJ2x:e(x)=z ^x,y,t) 
= TiyeX (Ylix:e{x)=z ^x,y,t) 

Z^Xl,X2...,X\X\^1T^lX1T^2--XmM ^^j=l i ij=l ^Xj(l},yj,t 

= Ea;i,S2...,X|;,|6miXm2...xmM ^MvEfJl Efil Mxjii))PY\xiXj{i),yj) 

= T,xuX2...,x^x\emiXm2...xmM ^^P(^ Ei,i=l nt{Xi)PY\x{Xuyj) ' Ixie(xj(l),xj(2)...,Xji\X\))) 

= J2{e,d)eAe \e,d),t 

(21) 

In the second line, we change the order of the products, first over all z's for a given y and then 
on all the y's. In the third line, we calculate the product over z. x = (x(l), x(2), . . . , x{M)) 
is a vector of length M, where x(l) G mi (i.e. e(x(l)) = 1), x{2) € m2 etc.. The sum is over 
all such vectors. In other words, when expanding the product over z, we obtain the sum 
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of all combinations of multiplying M \x,y,ts where the x of each A belongs to a different 
subset of letters (according to the encoder). In the fourth line, we expand the product over 
y. Now, we obtain the sum of all combinations of multiplying \X\ terms, where each product 
depends on some vector x as defined before, and on a different y. The rest is obtained by 
simply substituting the expression for \x,y,t- Choosing a decoder for a given encoder e, is 
actually choosing 1^"! vectors of length M, one vector for each y. The vector of each y, 
contains the decoding for each z given that y, as explained for the third line. 
In the second step, we choose the decoder randomly according to: 

Pr{decoder = d \ encoder = e} = ^'^'^^'^ (22) 

Fe,t 

The random choice of the decoder can be implemented efficiently in the following way: For 
each pair {z,y), choose the decoder output d{z,y) randomly, according to the probability 
distribution: 

Pr{d{z, y)=x} = (23) 

2^x':e(x')=z '^x',y,t 

where x G {x : e(x) = z}. Choosing the decoder function in this way, we get that: 
Pr{decoder = d \ encoder = e} = Pr{d{z, y) = x} = ^ — — 

yeX z=l Hj/eA" Y[z=l '^x':e{x')=z ^x',y,t 

(24) 

The numerator and denominator were already proved to be given by ^(e,d),t ^-iid -Fg,*, respec- 



tively in (21). Therefore, the decoder is indeed chosen according to (22). We demonstrated 
an efficient random selection of a pair (e, d). Below is a formal description of the on-line 
algorithm: 



1. Calculate I, the optimal length of a data block, according to (12), and let K = n/l. 

2. Initialize k to 0, and all the weights Xx,y,o to 1. 

3. At the beginning of block no. k, update the weights in the following way: 

Ax,s/,tfe = >^x,y,tk_ieMvEiL(k^i)i+iIx,=xPY\xix,y)) 
tk = kl + l, l<k<K -I 

4. For each {e,z,y), calculate the sums: 



5. Calculate -Fe,tfci for each e ^ E, according to (20). 
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6. Choose an encoder randomly according to (17). 



7. For each pair {z,y), choose the decoder function dk randomly according to (23). 



8. Use the first \log{N) / R\ channel symbols at the beginning of the A;th block to inform 
the decoder the identity of dk, chosen in the previous step, where N is the number of 
experts. 

9. Encode the next block using the chosen expert e^: 
Zi = Ckixi), kl + log{N)/R + l<i<{k + 1)1-1 

10. k < K, increment k and go to 3. 

The total complexity of the algorithm is 0{n/l ■ + 0{n/l ■ |£;||^|M) + 0(n). The 

complexity depends on \E\, which thus should be small as was mentioned above. 
The computational complexity of the algorithm is as follows: The calculations of '^iL(k-i)i+i ^Xi=x 
for each x £ X at each time take 0(n) computations totally. After calculating these quan- 
tities, we can update the Xx^y^s as described in step 3 of the algorithm above. This takes 
0(|Af|2). Calculating the sums in step 4, given the Aa;,j;,t's, takes 0(|Afp). Calculating the 
Fg^ts takes Od-Ej | A'|M) calculations where is the cardinality of the set of encoders. 

3.1 Large set of encoders with structure 

As was shown, we can choose a pair (e, d) randomly, in two steps. In the first step, we choose 



the encoder according to (17). In the second step, we choose randomly one of its possible 



decoders according to (22). In the previous part, we assumed that the set of encoders is 



small, so we can implement (17) directly, i.e., calculate F^t for each encoder separately. In 



this part, we use a large structured set of encoders. Using the structure, we can efficiently 



implement (17). We assume that the input alphabet X is ordered. We enumerate the source 
symbols according to that order. By Num{x), 1 < Num{x) < we denote the location 
of the symbol x in that order. 

3.1.1 Definition of the set of encoders 

The Input Alphabet Axis (lAA) is defined as the | A" | -dimensional vector (1,2, ... , \X\). A 
partition of the lAA is given by the (M — l)-dimensional sequence r = (zi, . . . , zm-i), Zi G 
{1,2,..., \X\ — 1},0 = zq < zi < . . . < zm = \X\. Each partition r represents a specific 
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encoder in the following way: 

e{x) = i : Zi^i < Num{x) < Zi, i£{l,...,M} (25) 

We define E as the set of all such encoders. The cardinality of the set of encoders is 
M - 1 

3.1.2 Graphical representation of the set of encoders 

The random choice of the encoders can be done efficiently using an a-cyclic directed graph 
(see Fig. 1). We use the following notation: 
V - The set of all vertices: 

{1,2,. . ., lA"! - 1} X {1,2,...,M- 1}U (0,0) U i\X\,M) 
£ - The set of all edges: 

{((z, J - 1), (z,i)) : z, z G {0, 1, 2, . . . , \X\},j G {1, 2, . . . , M}, z > z) 
s - The starting point in the bottom left, i.e. (0, 0) 
u - The end point in the top right, i.e. (M, \ 
£z - The set of all edges starting from vertex z. 

A general graph is described in Fig. 1. The horizontal axis represents the ordered input 
alphabet. The vertical axis represents the M — 1 choices needed for dividing the lAA into 
M segments. A path composed of the edges {(0, 0), (zi, 1) . . . , {zm-i,M — 1), (jA"!, M)} 
represents M — 1 consecutive choices of M — 1 x's (zi, . . . , zm-i) which divide the lAA into 
M segments, creating M subsets of the input alphabet. Each edge on a path represents 
one choice, the choice of the next point on the horizontal axis, which defines the next 
segment. An edge ((z,j — l),(z,j)) matches to the segment (z, z] on the horizontal axis, 
thus equivalent to the subset {x : z < x < z}. There are 0(M|^Yp) edges. For each edge 
a G £ and time t we assign a weight 6aX- 

Kt = Uy&xEx&{z,z]Ky,t, a = ((^,J-l),(5,i)) (26) 



where \x.y,t is given by (19). It can be seen from (26) that a weight 5a.t depends only on 
the horizontal coordinates of the edge a, thus we can denote it as (^(^^f),*- The cumulative 
weight of a path r = {(0, 0), (zi, 1) . . . , (zm-i, M — 1), M)} at time t is defined as the 
product of its edge weights: 

Ar,t = ll6a,t (27) 
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Figure 1: The graph representing all possible partitions of the input alphabet into M 
subsets given the alphabet is ordered in some specific order. For example, the left dashed 
arrow defines the subset {1,2,3}, the middle and right dashed arrows define the subsets 
{4} and {\X\ - 2, \X\ - 1, \X\} respectively. 

Ar,t is simply Fe^t- 

M 

K,t = n = n n e ^-'V^^ = ^--^ (^s) 

aer m=l yeA" xe(2™-i,2m] 



where the last equality was proved in (21). From now on, our WPA description is general, 
thus will be used for all a-cyclic directed graphs in this article. Following the WPA, also 
used in [4] and [5] we define: 

Gt{z) = ^ Ar,t (29) 

where now, z is a vertex on the graph (and not only coordinate), TZz is the set of all paths 
from z to u and a is an edge on the path r. 
We see that: 

Gt{s)= ^r,t = ^F,,t (30) 

where E, the set of encoders, is of course equivalent to TZg, the set of all paths from s to u. 
The function Gt{z) can be computed recursively: 

Gt{u) = I, Gt{z) = Ei:(.,i)e^ 6^z,z),tGt{z) (31) 

Because each edge is taken exactly once, calculating Gt{z) for all z's requires 0(|iS|) com- 
putations given the weights 5a,t- The function Gt{z) offers an efficient way to choose an 
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encoder randomly according to probability distribution in (17). 
We define for each z ^ Ez'. 

Pt{z\z) = <5(,,,),iGt(z)/Gt(z) (32) 

It is easy to see that Pt{z\z) is indeed a probability distribution, i.e., z)e£, Pt{z\z) = 1. 

We also have: 

M M 



m=l m=l 



Gt{Zm,r) 

M 



,Zm,r),t I ^ ^ \ 



Gt{U) -r-r . 



m=l 

Pr{next encoder = r} (33) 



and we get exactly the probability in (17|. 

Therefore, the encoder can be chosen randomly in the following sequential manner: Starting 
from zq = s, at each step m = 1, 2, . . . , M — 1, choose the next vertex Zm G ^Zm-i with 
probability Pt{zm\zm-i)- The procedure stops when Zm = u. 

Formal description of the on-line algorithm: Using the set of encoders described above, we 
now have the following algorithm: 



1. Calculate /, the optimal length of a data block, according to (12), and let K = njl. 

2. Initialize k to 0, and all the weights \x,yfl to 1. 

3. Build the encoders graph as described in this section. 

4. Initialize all the weights 5a fi to 1. 

5. At the beginning of block no. /c, i.e. at time t^ = kl + l,k = {1, . . . , K} update the 
weights in the following way: 

Ax,s,,tfe = Ax,yA-iexp(r/X;('i(fc-i)i+i^,=x^'y|x(a;,y)) 



6. At the beginning of block no. k, calculate 5z,z.tk each pair {z,z) according to (26). 

7. Update the weights of all edges to the new 5(2,z),tj.'s. 



8. Calculate Gt^{z) recursively, for all z, according to (31) 



9. Choose the encoder et randomly as described above, using (32) 
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10. For each pair {z,y), choose the decoder function dk randomly according to (23). 



11. Use the first \log{N) / R\ channel symbols at the beginning of the A;th block to inform 
the decoder the identity of d^, chosen in the previous step, where N is the number of 
experts. 

12. Encode the next block, using the chosen expert e^: 
Zi = Ckixi), kl + log{N)/R + l<i<{k + 1)1-1 

13. 11 k < K, increment k and go to 3. 

The total complexity of the algorithm is 0{n/l-\X\^)+0{n/l- M\X\'^) + 0{n/l\X\'^) + 0{n). 

4 Definition of an On-line Adaptive Variable-Rate Coding 
Scheme 

In this section, we generalize the results of Theorem 1 to the variable-rate coding case. 
This is done by generalizing the performance criteria, to include also the compression, in 
addition to the cumulative distortion of the code. The scheme we use is similar to that 
of theorem 1. The use of variable-rate codes complicates the problem. A choice of an 
expert is actually a combination of two choices. We now have to choose simultaneously, 
a lossy code and a lossless variable-rate code (as will be explained). The challenge is to 
describe the reference set in such a way that allow us efficient implementation. We start by 
defining a variable-rate code. Without loss of generality, we assume that the compressed 
sequence is binary. We define Cm as a binary prefix code, which contains M codewords 
{bi, 62, • • • , ^mIj where each bi is a binary string of length l{bi). We call the ordered set 
{l{bi), l{b2), . . • , l{bAj)}, the length set of the code. Since we deal with prefix codes, a length 
set must, of course, maintain the Kraft inequality, i.e., X^i^^i ^"'^^'^ < 1. A source code of 
variable-rate is defined in the following way: Given a source sequence x"", the operation of 
the encoder can be described as being composed of two steps, the first one is lossy and the 
second is lossless. First, the encoder transforms x** into a sequence z" whose symbols {zi} 
take on values in the set {1,2,..., M}. If M < log \ X\, this step is of course lossy. Then, the 
encoder uses a code Cm to encode into 6"", a sequence of variable-length binary strings, 
by encoding each Zi into a codeword bi, where bi takes on values in a set {61, b2, ■ ■ ■ , 6m}- 
This step is lossless. The decoder, knowing 6", produces 2:" without error. Then, based on 
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z" and the side information sequence y", the decoder produces the reconstructed sequence 
x^. Throughout the rest of the paper, we omit the intermediate sequence z^, and the 
lossless part of the decoding. We define the Lagrangian Cost (LC) for the input symbol at 
time t, Xf, as: 



where A(xt) G [0,B] is a bounded distortion measure, l{bt) is the length of the binary 
codeword at time t and (5 is a positive constant. If the Cm's are Huffman codes, l{bt) is 
bounded by M — 1, the maximal depth of a complete binary tree with M leaves. 

4.1 Definition of the reference set of source codes 

In this part, we define the general set of variable-rate scalar source codes, i.e., our set 
of experts. Each expert is a source code with M binary codewords which consists of some 
binary prefix code Cm- Each expert partitions X into M disjoint subsets (mi, m2, . . . , niM), 
where each subset rrii is encoded as a binary codeword bi, i G {1,2,..., M}. The variable- 
rate encoder e for each expert is given by a function e : X — 62, ■ ■ • , ^'m} that is, 
bi = e{xi). The decoder d receives bi, and together with the side information yj if available, 
decides on Xj, using a decoding function d : {61,62, • • • ,6m} x X ^ X, i.e., Xi = d{bi,yi). 
The set of decoders is defined as in Subsection 2.1, with only one difference: instead of 
getting an index Zi, it gets a binary codeword which represents this index. Again, we limit 
our discussion to decoders which satisfy: 



To complete the definition, as was explained in Subsection 2.1, all the encoders which have 
the same functionality is treated as the same encoder. In this part, by same functionality, 
we mean that encoders which implement the same partition of the input alphabet as was 
defined in Subsection 2.1, and in addition, have the same length set, are treated as the same 
encoder. 

We define Q^^{M) as the set of all variable-rate scalar source codes, i.e., all the pairs of 
variable-rate scalar encoders and one of their possible decoders, as defined in this section. 

4.2 An on-line Vciriable-rate coding scheme 

In this part, we describe an on-line adaptive variable-rate scheme coding based on the results 
of [3] . We actually extend Theorem 1 from the case of a pure distortion criterion to the LC 



C{xt) = A{xt) + dl{bt) 



(34) 



d{e{x), y) = X, for all x and y 



(35) 
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case. For any source sequence x", the LC '^"^^^(a;") of a source code (e, d) is defined by: 

n 

For the WZ case of scalar source code we get: 

n 11 

Re,dp'') = E E PY\x{y\xt)p{xud{e{xt),y)) + 5Y,Kh). (37) 

t=l y&X t=l 

Given any finite set of variable-rate scalar source codes, our coding scheme achieves asymp- 
totically the LC, -^"^^^(2:"), of the best source code in the set, for all source sequences x". 
To be more specific, it can be shown, in a similar way as in [3], that for any bounded 
distortion measure and some positive (5, the following result holds: 

Theorem 2 Let A be a finite subset ofQ^^{M). Then there exists a sequential source code 
(e, d) such that for all € A""; 

-^"^(e'.<^')6^'C(e',d')(^")]> ^ Ci[log|^|]in-i (38) 
Where C\ is a constant depends only on B, M and 5. 

The proof is similar to the proof of Theorem 1. Nonetheless, we give the full proof for 
completeness because there are some differences between the case of fixed-rate and the 
variable-rate case. 
Proof of Theorem 2 : 

The scheme works similarly to the scheme in the fixed-rate case: Assume that we have 
some reference set A of variable-rate WZ scalar source codes. We divide the time axis, 
t = 1, 2, . . . , n, into K = n/l consecutive non-overlapping blocks (assuming I divides n). At 
the beginning of each block, i.e., at time t = {k — 1)1, A; € {1, 2, ... , K}, we randomly choose 
an expert according to the exponential weighting probability distribution: 

exp{-?7/:L rf)(x*)} 

Pr{next expert = (e, d)} = = r ' ^ (39) 

The parameters I and rj > Q will be optimized later . After choosing the expert (e,d), the 
encoder dedicates the first [log |^|] bits, at the beginning of the A;-th block, to inform the 
decoder the identity of d. At the remainder of the block, the encoder produces the binary 
strings hi = e{xi). At the same time, at the decoder side, when getting the first [log |.4|] 
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bits of the block, the decoder outputs arbitrary symbols. At the rest of the block, knowing 
d, it reproduces Xj = d{bi,yi). Define for each k: 

W,= Yl exp{-r?£[^%V;(x)} (40) 
ie',d')eA 

As in [3], we then have ion N = n/l: 

log^ = log exp{-r?/:gf,,)(x)}-log|^| 

> log max exp{-r/£{^f (x)} - log |^| 

On the other hand, for each 1 < A; < n 

lo = lo ^(e'.'^Qe^ e^P{-^^(e'"jr''^'('')^ exp{-r?£g,~V^'(x)} 

= logEQ, exp{-r?£[^rj,5'+^'^'(x)} 

< -v{m[l;^^'^'''\^)]-Blog\A\} + '^ (42) 

where in our case, we have defined the maximum LC for a single input symbol: 

B = B + 6{M - 1) (43) 

and where Eg^. denotes expectation with respect to the distribution Q/j on A, which assigns 
a probability proportional to exp{— ?7>C^g,~j/)'} to each {e',d') in A. The expectation in the 
last line is with respect to the random choices of the code. The first inequality follows from 
the Hoeffding's bound (cf. [11, Lemma 8.1]). The second follows from the construction of 
the code described above. We use the first [log |.4|] bits of each data block, to inform the 
decoder the identity of the encoder. This causes a cumulative distortion which depends on 
the number of codewords we lose. UnHke in [3] , this number is not constant because we use 
a variable-length code. The maximal mmibcr of codewords we can lose is [log |.A|] — 1 (The 
worst case is when each one of the first [log |^|] — 1 codewords have length of one bit. Since 
the [log |.4.|] bit belongs to the next codeword we lose it too). Therefore, the cumulative 
distortion caused by the lose of the first [log |.4|] bits can be no more than B [log |.4|] . The 
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cumulative distortion of the rest of the block is exactly the loss of the pair (e, d) chosen at 
the beginning of the block. Summing over k, we get: 

^ -.E^[^!e%'r'"(-)] + '^^^i^g i-^i + (44) 

^ k=l 



Combining (41) and (44), we get: 

N 



gE[£j:,X'''''(x)|-,mm^£;-.,.,(x) 
r] 8 

= Bv^log 1^1/2 -l-m + BNlog \A\ (45) 



where the equality follows upon taking the minimizing value r/ = y 8 log |^|//^i?^A^. For 



convenience, we denote a = By^log \ A\/2 and /? = i?log|^|, so that the last line of (45) 

11 2 

becomes alN'i +(3N = anN~2 +(3N . Minimizing with respect to N we take N = (an/2/3)3 

2 1 

and get an expression upper bounded by 2{an)i f3^ . Placing the values of a,/3 we obtain: 



E{i[£J^^-)(x) - ^^min_^£f,,,,)(x)]} < bHiB \og\ A\)ln-'^ (46) 



Throughout the above proof, we got the following optimized values for / and r/: 

I = 2{\og{\A\)nB'^ / B'^}\ 
7] = {81og|^|/Z^2n}5 



(47) 



5 Efficient implementation for sets of scalar source codes 
with variable-rate coding 

In this section, we present an efficient implementation of the scheme described, for sets 
of variable-rate scalar source codes. Each one of these sets of source codes consists of all 
pairs (e, d) where e is one of the encoders in some small set of encoders, and d is one of its 
possible decoders. At the beginning of Section 3, E was defined as some small set of fixed- 
rate encoders. We now define W as a small set of binary prefix codes. In this section, our 
set of encoders is defined to be S x 7i, which is all the encoders obtained by a combination 



between one of the encoders of E and a binary prefix code belongs to 7i. Generalizing ( 14 ), 
the LC of an expert (e, d) at time t is given by: 

^\e,dM) = Yl Mx)PY\x{yW(^,y)^A + ^ Yl Mx)l{e{x)) (48) 
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where A, nt{x) and Ib were defined in (14) and l{e{x)) is the length of the codeword 



b = e{x). As in (15), we change to a more convenient form by multiplying the numerator 



and denominator by exp{r/ ^^^^ and we get: 

Pr{next expert = {e,d)} = =^ (j2^11i — 1^ (|49) 

l^{e',d')eA\e',d'),t ■ le',t 



where \(e,d),t is given by (16) and where we define: 



7e,t = exp{-7? • (5 ^ ni(x)/(e(j;))} (50) 

As in Section 3, given a set of experts, the random choice of an expert at the beginning of 
each block is done in two steps. First, we choose an encoder randomly according to: 

Fet 

Pr{next encoder = e} = — ■ ^ (51) 

Z^e'eExH Fe',t' 

where E x TC is the set of encoders, and: 

Ff;t = E(e,rf)e^eexp{-?//:(e,d)(^*)} 

= J2{e,d)&Ae \e,d),t • 7e,i ^^2) 
= 7e,i • E{e,d)eA \e,d),t 
= 7e,t ■ Fe,t 

is the sum of the exponential weights of all experts in Ae, where Ae is the subset of all 



experts which use the encoder e, and F^. was defined in (18). It was shown in Section 3 
that Fg can be calculated efficiently. 7e,t can be calculated directly for each e, given that 
\E xTC\ is reasonably small. In the second step, we choose the decoder randomly exactly as 



we did before, according to (22). Let us show that the pair (e, d) is indeed chosen according 



to (51) 



Pr{next expert = {e,d)} = Pr{next encoder = e} ■ Pr{decoder = d \ encoder = e} 

= (-^e^i/Ee'eSxH-^eV') " i\e,d),t/ Fe,t) 
= le,t ■ \e,d),t/J2e'&ExH F^^t' 

(53) 

We demonstrated an efficient random choice of a pair (e, d). Below is a formal description 
of the on-line algorithm: 



1. Calculate I, the optimal length of a data block, according to (47), and let K = n/l. 

2. Initialize k to 0, and all the weights \x,yfl and 7e_o to 1. 

3. At the beginning of block no. /c, update the weights in the following way: 

Ax,y,ifc = >^x,y,tk-i exp{77 Eii(fe-i)i+i Ix,=xPy\x{x, y)] 
tk = kl + l, l<k<K-l 
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4. For each {e,z,y), calculate the sums: 



5. Calculate F^tu, for each e £ E, according to (20). 



6. Update the 7's, for each e £ E x TC, in the following way: 



7. Calculate F^^^^, for each e £ E xH according to (52) 



8. Choose an encoder randomly according to (51). 



9. For each pair {z,y), choose the decoder function dk randomly according to (23) 



10. Use the first [log(A^)] bits at the beginning of the kth block to inform the decoder 
the identity of dk, chosen in the previous step, where is the number of experts. 

11. Encode the next block using the chosen expert e^: 
bi = ek{xi), kl + log{N) + 1 < i < {k + 1)1 - 1 

12. If A; < i^, increment k and go to 3. 

Notice that in step 10, the lower bound on i is the worst case, as was explained in the proof 
of Theorem 2. The total complexity of the algorithm is 0{n/l ■ + 0{n/l ■ \E x T-L\ ■ 
M) + 0{n/l ■ \E\\X\M) + 0{n). The complexity depends on \E xT-i\, which thus should be 
small as was mentioned above. 

In the following subsection, we first show an efficient scheme which use an a-cyclic directed 
graph and the WPA to implement an adaptive Huffman coding. We then use the idea of 
representing all Huffman codes by a graph, to extend the scheme for structured sets of 
encoders from Subsection 3.1, and build a full LC scheme for the WZ case. 

5.1 An efficient adaptive lossless coding scheme using Huffman codes 

In this subsection, we assume that the input alphabet have M symbols and we use M 
codewords, thus the encoding is lossless. This is a special case of the general LC coding 
we defined in the previous parts, when A(a;) = for all x. From now on, the prefix codes 
we use are Huffman codes. Using their structure, we can efficiently implement our coding 
schemes. An Huffman code will be characterized by a mapping -ff^^, H^.^ : (1, . . . , M) 
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{h, ■ ■ ■ ,1m), such that Hj^{i) = G {1, . . . , log(A)} where A = 2' for some positive 
integer I < M. The k's represents the lengths of codewords of some Huffman code with 
M codewords and maximum codeword length of log(A). We call H}^^{i) the length function 
of the Huffman code. It is well known that H^j{i) is indeed a legitimate length function 
of some Huffman code if and only if ^^-|^2~^m(*) = 1. Notice that from our point of 
view, all Huffman codes with the same length function or equivalently, the same length set, 
have the same functionality, thus considered as the same code. Given some length set, it is 
of no importance, of course, which Huffman codebook will be actually used for encoding. 
Building an Huffman codebook from a length function is straightforward. We will use the 
scheme described in Theorem 2 for creating the sequential source code. 

5.1.1 Definition of the reference set of source codes 

We define Wm(A) as the set of all Huffman codes (or equivalently, of all Huffman length 
sets) with M codewords and maximal length of log(A). Our reference set of source codes is 
HmW- Each encoder is defined by a mapping e(i) = bi,i £ {1,2, . . . ,M}. {61,62, • • • , 6^} 
is some Huffman codebook with length set {/(61), Z(62), . . . ,/(6m)}- As was explained, the 
actual codebook can be chosen arbitrarily among all codebooks which share the same length 
set. The corresponding decoder is, of course, defined by d(6j) = i,i e {1,2, ... , M}. 

5.1.2 Graphical representation of all Huffman codes with maximal length log(A) 

Our next step is to reduce the problem of designing our source code (in other words, 
choosing randomly G 7^m(A), for each k, k E {1,2, . . . , N} given x^'^~^^'') to the problem 
of choosing randomly a path on an a-cyclic directed graph. 

We describe each Huffman code as a path r on a graph in the following way (see Fig. 2): 
We use the following notation: 
V - The set of all vertices: 

{i, f , . . . , ^} X {1, 2, . . . , M - 1} U (0, 0) U (1, M) 
6 - The set of all edges: 

{((g, j - 1), iq,j)) ■.q,qe{0,{,l,...,l},q> q, log G Z+,j G {1, 2, . . . , M}} 
s - The starting point at the bottom left, i.e. (0,0). 
u - The end point at the top right, i.e. (1,M). 
Sz - The set of all edges starting from vertex z. 
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A general graph and the graph for all Huffman codes of order 3 with A = 4, i.e., with 
maximal length / = 2 are described in Fig. |2j The horizontal axis represents the Probability 
Axis (PA) [0, 1]. The vertical axis represents the M — 1 choices needed for dividing the PA 
into M segments. A path composed of the edges {(0, 0), (qi, 1) . . . , {qm-i, M — 1), (1, M)} 
represents M — 1 consecutive choices of M — 1 points (gi, . . . , qm-i) which divide the PA 
into M segments, creating discrete probability distribution, with M probabilities. Each 
edge on a path represents one choice, the choice of the next point on the PA, which defines 
the next segment. An edge {{q,j — 1), j)) matches to the segment [q, q) on the PA, thus 
equivalent to the probability {q — q)- Each path is thus equivalent to some probability 
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Figure 2: The above figure describes a general graph. Below is a directed graph that 
represents i/3(4), i.e., all Huffman codes with M = 3 codewords and A = 4. The dashed 
path represents the probability function ^, ^} which is equivalent to the length set 



{h=1M = 1,^3 

2 are legal. 



2}. Remember that only edges with (g — q) equal to negative power of 



distribution. The correspondence between a probability distribution and a source code is as 
follows: each probability distribution {pi,P2; • • • ;Pa/}) corresponds to the binary prefix code 



which has the length set { log(^) , log(^^ 



log(^) }, i.e., to its suitable Shannon 



code. Therefore, in order to represent all the Huffman codes and only them, we allow only 
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partitions which divides the PA to negative integer powers of 2. This is implemented in the 
following way: 

First, we consider only edges ((g, m), (q, m + 1)) such that log (^^g) G as mentioned 
above, meaning that we choose only symbol probabilities from the type pm = 2~-',j G 
Z+,m G {1,2, .. . ,M}. We get \£\ < MAlog(A) edges. It is easy to see that part of the 
edges are not members of at least one full path from s to u, thus of no use. In order to get 
rid of the unnecessary edges, we simply start from the edges which end on M — 1 vertically, 
and erase those with no edges starting from them. Then we move down to row M — 2, 
and so on. This process takes 0(MAlog(A)) time and is done once and off-line. After 
"cleaning" the graph, we have a graph that contains all the probability functions from the 
type P : P{m) = 2~^"^ ,jm G m G {1,2,..., M} and only them. It is well known that for 
probabilities of this type, the length set of the corresponding Huffman code is identical to 
that of the Shannon code and is simply {log(^), log(^), . . . , log(^)}. Each path matches 
uniquely to a specific Huffman code from the set 'Hm(A) and vice versa, so the graph cover 
all Huffman codes with different length set and maximal codeword length of log(A) and only 
them. For each edge a £ £ and time t, we assign a weight 6a,t- 

5a,t = exp{-?7/t(j)log^^} = exp{-r//t(j)/(g,^)}, a = {{q,j - 1), {q,j)) (54) 
where ft{j) is the empirical relative frequency of the jth input symbol at time t, i.e., the 



number of times this symbol appears in the input sequence x . It can be seen from (54 ) that 
a weight 6a,t depends only on the horizontal coordinates of the edge a, thus we can denote 
it as The cumulative weight of a path r = {(0, 0), (qi, 1) . . . , {qAi-i, M — 1), (1, M)} 

at time t is defined as the product of its edges' weights: 

K,t = n ^<^^t = exp{-r? Yl ftiMQ, q)} (55) 

The sum is exactly the cumulative length of the encoding of the string x*, using the Huffman 
code represented by this path, i.e., the length of bt- Again, following the WPA, we define: 

Gt{z)= YllSa,t (56) 

where z is a vertex on the graph, TZz is the set of all paths from z to u and a is an edge on 
the path r. Continuing exactly as in Subsection 3.1.2, we efficiently implement the random 



choices of codes according to (51). 

25 



The procedure of updating the weights and finding the next code a total of 0{\S\)+0{M) < 
0(MAlog(A)) + 0{M). Now, it is easy to see that it suffices to take A = min (2^~^ n) to 
get all relevant Huffman codes of order M when the sequence length is n. The procedure 

is repeated at the beginning of each data block, giving a total computational complexity 
of 0{n/l ■ MAlog(A)) + 0{nM/l). As a special case of the general LC scheme, we get the 
following result: 

Corollary 2.1 LetTlMi^) ^he set of Huffman codes we defined above. Then there exists 
a sequential source code (e, d) such that for all G X"^ : 

K{-[Cl.Ax^)- min < MVlog|7^M(A)|/2(n/0-^ (57) 

Moreover, the scheme can he implemented with computational complexity of 0{n/l-M Xlog{X))+ 
0{n/l-M). 

The proof is similar to that of the Theorem 2, with 

n 

t=l 

and with one additional difference: suppose we use a randomization sequence {Ui} of i.i.d. 
random variables, uniformly distributed on [0, 1], for implementing the random choices used 
in our WPA. If we assume that the decoder also has access to this sequence, there is no 
need to inform it the identity of the encoder. Since the encoding is lossless, the decoder 
has all the information about the past. Therefore, given the randomization sequence, the 
decoder can achieve the identity of the next source code by itself. 

Also notice that in this case, in choosing Z, there is a trade-off between convergence and the 
computational complexity. Choosing a small I will improve the upper bound, but on the 
other hand, will increase the complexity. 

In order to get some feeling about the compression performance of this scheme, we give the 
following example. 

Example. : if we have n = 10^°, M = 256 and we take I = log(n), A = n so we have 
Hm(A) < [log(n)]^, we obtain that the difference between the best static Huffman code for 

and our scheme is less than 0.3 bit per symbol. 
Formal description of the on-line algorithm: Using the set of encoders described above, we 
now have the following algorithm: 
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1. Choose I, the length of a data block, and let K = n/l. 

2. Initialize A; to 0. 

3. Build the encoders graph as described in this section. 

4. Initialize all the weights 6a fl to 1. 

5. At the beginning of block no. k, i.e. at time tk = kl + l,k = {1, . . . , K}, calculate 



^q,q,tk ^oi each pair {q,q) according to (54|. 



6. Update the weights of all edges to the new ^^.'s. 



7. Calculate Gt^.{z) recursively, for all z, according to (31) 



8. Choose the encoder randomly as described in Subsection 3.1.2, using (32). 



9. Encode the next block, using the chosen expert e^: 
bi = efc(xj), kl < i < {k + 1)1 — 1 

10. li k < K, increment k and go to 5. 

5.2 An efficient adaptive LC scheme for the WZ case 

In this subsection, we return to the general LC scheme and the reference set of source codes 
defined in Subsection 4.2. We combine the WZ scheme from Subsection 3.1 and the Huffman 
coding from the previous subsection into one efficient LC scheme. One interesting special 
case of the following scheme, is described in appendix A. This special case is obtained by 
degenerating the side information alphabet into alphabet of size one. 

5.2.1 Definition of the reference set of source codes 

The Input Alphabet Axis (lAA) is defined as the -sized vector (1,2,...,|^Y|). A division 
of the lAA is given by the (M — l)-sized increasing sequence r = {zi, . . . , zm-i), Zi G 
{1,2, . . . — 1}. zq and zm are defined to be and \X\ respectively. Each combination 
between specific division r and a specific Huffman length function defines a specific encoder 
in the following way: 

e{x) = bi : Zi-i < Num{x) < z^, ie{l,...,M}, h e {h}fl^, {l{bi)}fi^ e HmW 

(59) 

We define E x 7iM{^) as the set of all encoders which obtained by such combination. 
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5.2.2 Graphical representation of the set of encoders 

The random choice of the encoders can be done efficiently using in this case, a Three- 
Dimensional (3D) a-cychc directed graph instead of 2D. We use the following notation: 
V - The set of all vertices: 

{1,2,...,|A'|-1} X X {1,2,...,M-1}U(0,0,0)U(|A'|,1,M) 

£ - The set of all edges: {[(z, g, j — 1), (z, : -z, z G {0, 1, 2, . . . , \X\} 

{0,i,|,...,l},i G {1,2,...,M},5 > z,g> g,log(^ GZ+} 
s - The starting point in the bottom left, i.e. (0, 0, 0) 
u - The end point in the top right, i.e. ^,M) 
£z - The set of all edges starting from vertex z. 

The lAA represents the ordered input alphabet. The PA represents the probability 
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Figure 3: The lower graph presents the PA-IAA plane for \X\ = 6, M = 3 and A = 4. 
The dashed path represent the probability distribution with probabilities {|, ^} attached 
to the input alphabet subsets {{1, 2, 3}, {4, 5}, {6}} respectively. The other path represent 
the probability distribution {j, ^, |} attached to the partition {{1, 2}, {3, 4, 5}, {6}}. The 
upper graph shows the same paths on the lAA - vertical axis plane. Remember that the 
vertical axis represent the M — 1 consecutive decisions needed for defining an encoder. 

axis [0,1]. The vertical axis represents the M — 1 choices needed for dividing the lAA 
and the PA simultaneously, each one into M segments. A path composed of the edges 
{(0, 0, 0), {zi,qi, 1) . . . , {zm~i, Qm-i, M— 1), 1, M)} represents M— 1 consecutive choices 
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of M — 1 points {[zi, qi], . . . , [zm~i, Qm-i]) on the PA-IAA plane, which divide the lAA and 
the PA into M segments, creating M subsets of the input alphabet, and M probabilities. 
Each edge on a path represents one choice, the choice of the next point on the horizontal 
subspace, which defines the next segment on the lAA and the next point on the PA. An 
edge {{z,q,j — l),{z,q,j)) matches to the segment {z,z] on the lAA and to the segment 
{q, q] on the PA, thus equivalent to the subset {x : z < x < z} when assigned the probabil- 
ity {q ~ q)- Therefore, a path from s to u, having M edges, defines partition of the input 
alphabet into M subsets, each subset assigned a probability. Each probability (g — q) is 
equivalent to the length log{-~^), which is the length of a codeword in a real Huffman code 
as was explained in detail in Subsection 5.1. Therefore, each partition defines a specific 
encoder, which implements the alphabet division r = (zi, . . . , zm-i) and uses an Huffman 

code with the length set |log( — - — ), . . . ,log( )| for the variable rate coding part, 

where the lengths assigned to the subsets respectively. As in the case with no distortion, we 
"clear" the graph at off-line from edges with no use. This is done in the same way described 
in Subsection 5.1.2. After that, we end up with \£\ < M|AfpA • log(A). An example of a 
PA-IAA plane is presented in Fig. [3j For each edge a £ £ and time t, we assign a weight 

^a,t = ^{z,z),t ■ exp{-r/ • 5 ■ ft{z,z)l{q,q)}, a = {{z,q,j - 1), {z,q,j)) (60) 



where 5(z,i),t is given by (26), and ft{z,z) is given by: 

t 

ft{z,z) = J2^x,e{z,z] (61) 

i=l 

which is the empirical frequency of the subset {x : z < x < z} in the input sequence x*. 
The cumulative weight of a path {(0, 0, 0), (zi, gi, 1) . . . , {zm-i,Qm-i, M — 1), 1,M)} 
at time t is the product of its edges' weights: 



a&r 



Ar,t is simply Ff-,: 



^r,t = Ua€r K 



Ilm=i{^{z,z),t ■ exp{-r? • 6 ■ ft{Zm-l,Zm)l{qm-l,qm)}} ^g3^ 
I\m=l ^{z,z),t nm=l exp{-T/ • 6 ■ ft{Zrn-l, Zm)l{qm-1, qm)} 



where the last equality follows from (28) and the definition of 7e,t. Following the WPA 
exactly as we did in the previous parts, we implement efficiently the random choice of the 
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encoder. The random choice of one of the possible decoders given the encoder, remains the 
same, as was shown before. 

Formal description of the on-line algorithm: Using the set of encoders described above, we 
now have the fohowing algorithm: 



1. Calculate I, the optimal length of a data block, according to (47), and let K = n/l. 

2. Initialize k to 0, and all the weights Xx^yfl to 1. 

3. Build the encoders graph as described in this section. 

4. Initialize all the weights 5^,0 to 1. 

5. At the beginning of block no. k, i.e. at time = kl + 1, update the weights in the 
following way: 

Ax,s/,tfe = ^x,y,t^_lQ^v{r|T!l^{k-l)l+lI^^=xPY\x{x,y)) 

6. At the beginning of block no. k, calculate ^z,z,t,, and ft^{z,z) for each pair {z,z) 



according to (26) and (61), respectively. 



7. Update the weights of all edges to the new Sa^tJ^ according to (60) 



8. Calculate Gtf.{z) recursively, for all z, according to (31). 



9. Choose the encoder randomly as described in Subsection 3.1.2, using (32) 



10. For each pair {z,y), choose the decoder function dk randomly according to (23). 



11. Use the first [log(A^)] bits at the beginning of the kth block to inform the decoder 
the identity of dk, chosen in the previous step, where is the number of experts. 

12. Encode the next block using the chosen expert e^: 
bi = ek{xi), kl + log(A^) + 1 < i < (k + 1)1 - I 

13. 11 k < K, increment k and go to 3. 

The total complexity of the algorithm is 0{n/l-\X\^)+0{n/l-M\X\'^X-log{X))+0{n/l\?!:\'^) + 
0{n). 
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5.3 General distortion measures 



Let p{x, x) be some bounded distortion measure. Given an encoder e, we define: 



A 



'X,y,e,t — 6Xp 



{x':x'^x,e{x)=e{x')}Mx')PY\xiy\x')p{x,x')^ 



(64) 



It is easy to see that given some possible decoder d, the distortion of the pair r = (e, d) is: 

M 

^^'^ = n n ^d{z,y),y,e,t (65) 

using the generalized A's we defined, we can continue exactly as in the Hamming case. Each 
generalized Xx,y,e,t contains an exponent of a sum of Od-^l) products. So given an encoder 
the complexity is increased by a factor of An example of using a general distortion 
measure is in the next subsection. 

5.4 Veiriable-Rate coding - Quantizers with Huffman codes 

In this subsection, we describe a special case of the variable-rate coding scheme of Sec- 
tion 5.2. The ordered input alphabet is now composed of points on the real axis X = 
{0, . . . , 1} where K > 2 is some positive integer. It is easy to see that \X\ = K + 1. In 

this part, wc assume there is no side information or equivalently, that the side information 
alphabet is of size one. As described in 5.2, each encoder partitions the input alphabet into 
M subsets and use some Huffman code for the lossless coding part. We use some general 
bounded distortion measure which satisfies: 



Under these conditions, our set is actually a set of quantizers of size M, where the points 
of each quantizer are encoded with some Huffman code. A source code (e, d) is called a 
Nearest-Neighbor (NN) quantizer if for all x it satisfies: 



By definition, the distortion of a NN quantizer is always the minimal among all quantizers 
with the same points. It is easy to see that all the possible NN quantizers for this case, are 
included in our reference set of source codes. The cumulative LC is: 



p{x,x) = p{\x-x\) 



(66) 



p{\x - d{e{xm 



min p(\x — x' 
x'ex^' 



(67) 



n 



n 



n 




(68) 



t=i 



t=i 



t=i 
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We call our set of encoders Q x H. We build an a-cyclic directed graph according to the 
description in Subsection 5.2. For each edge a €z £ and time t, we assign a weight Sa^t, where 
ill \z,z),t "ws substitute the generalized A's defined in (64): 



A 



x,{z,z],t = exp < 



-r] nt{x')p{x,x') 

{x' ■.x'j^x,x'(^{z,z]} 



(69) 



Notice that the dependency on y was omitted, and that the dependency on e was replaced 



by (a:,-?]. This stems from the fact that the subset {x' : e{x) = e{x')} in (64), is equal to 
the subset {x' : x' £ iz,z]} in our case, by the definition of our graph. After choosing an 



encoder, choosing a decoder is done according to (23) where again, we use the generalized 
A's, and the dependency on y is omitted. The complexity of the algorithm remains the same 
as in Subsection 5.2. 
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